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Abstract. Polymers can be modeled as open polygonal paths and their closure 
generates knots. Knotted proteins detection is currently achieved via high- throughput 
methods based on a common framework insensitive to the handedness of knots. 
Here we propose a topological framework for the computation of the HOMFLY 
polynomial, an handedness-sensitive invariant. Our approach couples a multi- 
component reduction scheme with the polynomial computation. After validation on 
tabulated knots and links the framework was applied to the entire Protein Data Bank 
along with a set of selected topological checks that allowed to discard artificially 
entangled structures. This led to an up-to-date table of knotted proteins that also 
includes two newly detected right-handed trefoil knots in recently deposited protein 
structures. 

The application range of our framework is not limited to proteins and it can be extended 
to the topological analysis of biological and synthetic polymers and more generally to 
arbitrary polygonal paths. 
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1. Introduction 

The topological study of biological polymers has led to important insights into their 
structural properties and evolution [Tl[2]. From a topological point of view polymers 
can be naturally modeled as sequences of 3D points, i.e. open polygonal paths. Their 
closure generates classical objects in topology called knots. The simplest knot is the 
trefoil knot, illustrated in Figure [T]A.. The characterization of knotted proteins, due to 
their close structure-function relationship and reproducible entangled folding, is a sub- 
ject of increasing interest in both experimental and computational biology. 

Knots investigation was initially fostered by the discovery of knotted circular single- 
stranded DNA [3^ and has been followed by the study of the underlying enzymatic 
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Figure 1: A knot diagram and illustration of the Conway skein triple (A) Three 
dimensional polygonal representation of the trefoil knot (in red) and its planar diagram 
(in black). Two red spheres on the knot mark the 3D points Xi and X2 projecting 
down to X on the planar diagram along the brown arrow. (B) The Conway skein triple 
is composed of three oriented diagrams that are the same outside a small region, where 
they look like the illustrated L+, L_ and Lq. To define the oriented sign of a crossing, 
approach it along the underpass in the direction of the orientation: if the overpass 
orientation runs from left to right, the oriented sign is +1, —1 otherwise. 



mechanisms [4j[5j and more recently by the description of the topological organization 
and packing dynamics of bacteriophage P4 genome |6,7|. 

Despite those great advances in knotted DNA studies, we are only beginning to go 
deeper into protein knots characterization and the understanding of their biological 
role. After the pioneering work of Mansfield (s) and the definition of topological 
descriptors for the analysis of protein symmetries and proteins classification (9|jTT], 
the detection of knots in proteins was boosted by Taylor's work |12|. The exponential 
growth of the total number of structures deposited into the Protein Data Bank (PDB, 



http://www.pdb.org) u3\ requires dedicated computational high-throughput methods 



able to deal with a large amount of data These methods combine a structure 

reduction scheme of a protein backbone model with the computation of a knot invariant, 
the Alexander polynomial [9, 15 - 17 . Hereinafter with the term reduction we refer to a 
stepwise deletion of a certain number of points from the original structure (endpoints 
excluded) that preserves its ambient isotopy class. 

The most affirmed reduction algorithm is the KMT reduction scheme. KMT owes 
its name to the different algorithms proposed by Koniaris and Muthukumar |18| and 



Taylor [12, 19 . Since the use of this acronym has engendered a little confusion on 
which algorithm is precisely being used in literature we will explicitly refer to them 
by authors' names. Globally, these methods are based on the concept of elementary 
deformation |[20, 21 , which consists in the replacement of two sides of a triangle 
with the third provided that the triangle is empty. In particular while Koniaris 
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Figure 2: Knots met in proteins Illustration of the knots found in proteins, labeled 
according to Rolfsen names. U: the simplest knot, the unknot. 3i: the trefoil knot and 
its mirror image, denoted by the *, has three crossings. 4i: the figure-eight knot is 
the only knot with four crossings. 62: the three- twist knot has five crossings. 61: the 
Stevedore's knot, the most complex knot detected in proteins. 

and Muthukumar's algorithm essentially reproduces the ideas of Alexander-Briggs and 
Reidemeister, in the Taylor's algorithm (which Taylor himself considers a smoothing 
algorithm) the elementary deformation is done in steps that progressively smooth the 
chain at the cost of introducing points not belonging to the protein backbone; the 
edge replacement depends on some selected conditions |19| chosen to prevent numerical 
problems. 

Once the reduction has been accomplished knot type identification can be performed. 
This can be done either by visual inspection or by computing a polynomial invariant. 
Being easy to compute the Alexander polynomial represents the current default choice. 
This is also supported by the evidence that protein knots detected to date are the 
simplest ones as illustrated in Figure [2} 

Unfortunately, the Alexander polynomial does not distinguish a knot from its mirror 
image. Thus, for instance left- and right-handed trefoil knots share the same polynomial. 
Instead, more powerful invariants are able to determine knots chirality. 
Whereas to define the handedness of the simplest knot types is straightforward, its 
extension to more complex knots requires carefulness. However, for the purpose of this 
article, a knot is chiral if its mirror image and the knot itself belong to two different 
ambient isotopy classes and it is achiral otherwise. We define the handedness of knots 
according to (22j adopting the conventional values reported in the Atlas of Oriented 
Knots and Links (http:/ /at.yorku.ca/t/a/i/c/31.htm). 

As far as proteins are concerned, the handedness of protein knots was only partially 
addressed so far. 

Taylor points out the existence of both right- and left-handed trefoil knots, with a 
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neat right-handed preference |2|. This hypothesis was supported by the finding that 
aU trefoil knotted proteins belong to the SCOP [23] f3a class, where an intrinsic right- 
handed preference for f3af3 unit connections exists. The only left-handed trefoil knot 
was detected in the ubiquitin C-terminal hydrolases (Icmx) considered afterwards as an 
incomplete five crossings knot. However, by considering individual fragments the knot 
vanishes. A more recent work that removed sequence redundancy, intriguingly highlights 
a global 5 to 3 balance between right-handed and left-handed knots, not suggesting a 
bias for one of the two hands [24) . 

In order to compute invariants able to cope with knots chirality, here we propose a novel 
topological framework to compute arbitrary skein polynomials. A skein polynomial P 
respects the skein relation: 

aL+ — bL_ cLq (1) 

which is an algebraic relation connecting the configurations in a Conway skein triple [25j 
(see Figure [T]B), namely it verifies 

aP(L+) - bP{L_) = cP(Lo) 

where the coefficients a, 6, c have to satisfy some relations. For instance, the choice 
b = a~^, c = z leads to the HOMFLY polynomial P(a, z) |26|. By further specializing 
a = t~^ and z = t^^'^ — t~^^^ one obtains the Jones polynomial V{t) whereas setting 
a = 1 and z = t^/^ — leads to the Alexander polynomial A(t). As far as proteins 
are concerned, the handedness of protein knots was previously addressed by King et 
al. [27j and relies on the computation of the Jones polynomial. 

Although this appears to be enough to define the chirality of the currently detected 
knotted proteins, the HOMFLY polynomial is more powerful. For instance, whereas 
the Jones polynomial is the same for knots 10-022 and 10-035 of the Rolfsen table, the 
HOMFLY polynomial is able to discriminate them. In the realm of our method, other 
choices bring to the Vassiliev knots invariants |(28|[29] considered for instance by [30]. 
Generally, the skein relation does not preserve the multiplicity of a link. For example 
if is a knot, Lq will be a two components link. The recursion of the skein relation 
together with the values of the given polynomial on the unknot allows to reconstruct the 
polynomial of any given link. Therefore, the complexity of the polynomial computation 
grows exponentially with the number of crossings to be processed. Our algorithm relies 
on the iteration of the skein relation and explicitly constructs the Conway skein triple 
associated to a given crossing by a stepwise insertion of auxiliary points. 
In order to deal with multi-component links and speed up computations, the polynomial 
computation is preceded by the application of a structure reduction scheme, which we 
call MSR (Minimal Structure Reduction). The MSR algorithm exploits the interplay 
between the 3D structure and the corresponding 2D planar diagram of a polygonal 
path and basically relies on a 3D operation, namely the Generalized Reidemeister Move 
(CRM). While the Alexander-Briggs method intrinsically removes at most one point at 
each step, a CRM does not necessarily operate locally, usually leading to a dramatic 
reduction of the number of points in few steps. 
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The effectiveness and robustness of the proposed framework were initiaUy evaluated on 
tabulated knots and links, leading to an HOMFLY polynomial repository along with 
knots orientation details. We then applied our methods to protein structures. By 
screening the entire PDB (version of November 8, 2010), we obtained an up-to date 
table of knotted structures that also includes two newly detected right-handed trefoil 
knots. 

2. Methods 

2.1. Basic concepts and definitions 

To make this article self-contained, herein we introduce and briefly describe basic 
concepts and deflnitions. 

• Polygonal paths A pair (P, S) where P = {Pi, . . . , P^} is a collection of points in 
and aS = {S'o, S'l, . . . , Sk} is an ordered subset of [O..A^] (the integers in [0, A^]) 
with S'o = 0, Sk = N determines a collection of K polygonal paths in as follows: 
the k-th path (or component) is generated by connecting the points indexed by 



The edges of the polygonal paths are the oriented segments PiPi^i with i ^ E = 



A collection of polygonal paths (P, S) in is simple if each edge of the path 
intersects precisely the previous and the next edge at the endpoints |31|. 

• Polygonal link A collection C = (P, S) of simple polygonal paths is a polygonal 
link. The K = K{C) components of C are not necessarily closed. For the sake of 
convenience, a subpath will be deflned by indexing C with square brackets. 

• Regular Projection A projection tt : ^ of a polygonal link C is regular if the 
following conditions are satisfled: 

(i) The image 7r(>C) has at most a flnite number of double points (crossings). 

(ii) No vertex is a double point. 

A link diagram is a regular projection of the link whose graphical representation 
adopts solid edges and gaps to indicate overcrossings and undercrossings 
respectively (see Figure [l]A). With a slight abuse of language we will also call 
under /over crossings the points in that project to an over /under crossing in M^. 

• Intersection signs Given two sets of edges A and B we can compute the intersection 
matrix / = /(A, B) by setting 



1^—1 Ai lays under Bj 

If ^ = P we get an antisymmetric square matrix and we can simplify the notation 
to I {A). Intersection signs deflnition is detailed in Text SI. 



[Sk-i-'Sk]- 



[I..N -l]\S. 




if Ai and Bj do not intersect transversally 
if Ai lays over Bj 



(2) 
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• Minimal structure A minimal structure for a polygonal link £ is a nested sequence 
of subsets of C 

C D Ci D . . . D Cn 

that cannot be extended. Each inclusion corresponds to a Generalized Reidemeister 
Move, described below. 

2.2. Structure reduction algorithm 

Our reduction algorithm MSR iteratively exploits the subroutine GRM, which performs 
a Generalized Reidemeister Move according to the following scheme: 

Stepl: Move candidate selection, namely a subpath M of C. 

Step2: Move contraction >C^, which is the provisional replacement in £ of with the 
segment A^^ connecting the endpoints of Ai. 

Step3: Check that C and belong to the same ambient isotopy class. If so, the 
replacement described in Step2 becomes effective. 

While the first two steps are trivial, Step3 requires the study of the intersections of the 
move candidate A4 with the remainder C oi C. is characterized by its initial and 
final edge indices, respectively hM and Om and belongs to a specified component, say 
m of C. 

The complement C can be splitted in Cout, the link components different from m and Cin, 
the open link with at most two components given by C[{Sm-i--^M)] and C[{eM + ^--Sm)]- 
Let sign(A^) be the set of signs of /(Ai, C) and analogously sign(At^) be the set of signs 
of /(At^C). 

The topological check in Step3 requires the evaluation of the three following conditions: 

(T) Ai is ascending or descending (Triviality of At). 

(S) sign(Al) contains at most one element (Separability of A4 from C). 

(C) The set sign(Al) Usign(A^^) contains at most one element (Concordance of A4 
and Ai^ with respect to C). 

If TSC conditions hold, we call the replacement of At with At^ (and vice versa) a 
Generalized Reidemeister Move. A GRM is an equivalence relation for polygonal links. 
An example of an admissible move is illustrated in Text S2. 

Given a polygonal link £, its intersections matrix Ic = I{C) and the move initial index 
6, the GRM algorithm performs the following operations: 
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^out ^ 

e = 6+l 
5: while (e G E) do 
M = C[[b..e]] 
Check Condition (T) 
if (T) False then 
Go to Exit 
10: end if 

Check Condition (S) 
if (S) False then 

Goto Exit 
end if 

15: Compute the vector r — C) 

Construct Ic^ from Ic and r 

Check Condition {C) 

if {C) False then 
Goto Exit 
20: end if 

^out = 

r — 

^out ^ 

e = e + 1 
end while 

25: Exit 

^ ^out 
lout 

return C and Ic 



The key point of the algorithm is the construction of the intersection matrix 
from Ic (line 16) simply by replacing the rows and columns [b..e] of Ic with the vectors 
+r and — r respectively. Notably, this procedure greatly reduces the computational cost 
with respect to an explicit matrix computation. 

We are now ready to introduce MSR. Given a polygonal link C and an iteration limit n 
(suitable to achieve a partial reduction) MSR operates as follows: 
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Compute Ijc I{C) 

I := (Dynamic assignment) 

i = 1 

while {i < n) do 

5: ii I = 2 K (where K is the multiphcity of C) then 
Go to Exit 
end if 

p = #C 
6=1 

10: while (6 < / - 1) do 

6 = 6+1 
end while 

ifp = l (Reached minimal structure) then 
15: Go to Exit , 

end if 
i = i + l 
end while 
Exit 
20: return C 

2.3. Skein polynomials computation 

In the foUowing the interplay between three and two dimensions plays a fundamental 
role and it is realized through the standard projection tt^. Since tt^ restricted to C is 
invertible up to a finite number of double points, we denote with an uppercase letter 
objects of C and with the corresponding lowercase letter their projection. Counter 
images of double points are distinguished by subscripts. Obviously, any subpath in 
the projection has a unique lift to C and therefore in the following we adopt a two 
dimensional description. 

Given a polygonal oriented link, we consider two oriented edges Ei = P1P2 and 
E2 = P3P4 such that their projections Ci = piP2 and 62 = PsP4 cross at a point x. For 
the sake of convenience we assume that Ei lays under E2 and we respectively denote by 
Xi and X2 their points projecting down to x. The edges ei and 62 give rise to a skein 
configuration of type + or — . 

We implemented the Skein Relation on the 3D structure of C by construction of the 
corresponding skein configurations C^^ and jCq. With C^^ we refer to the switching of the 
crossing under consideration. Our algorithm performs the following steps (illustrated in 
Figure [3]) : 

Stepl: Construct an empty quadrilateral q containing x whose vertices belong to ei and 

62- 
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ABC 




Figure 3: Example of geometric construction of the skein configurations (A) 

Figure-eight polygonal knot diagram. Knot orientation and the crossing x between the 
edges ei and 62 are shown. (B) A clean quadrilateral q around x is shown in red. 
(C) The rotated quadrilateral r (solid blue lines) is obtained by rotating q (dashed 
red lines) along the z axis. (D) Triangles to be analyzed in the topological check are 
shaded in green. The points q and r are reported respectively in red and blue. (E) 
The configuration, with the path Pii?iXswi?2^2 highlighted in black (F) The Cq 
configuration. Solid lines highlight new connections P1R1R4P4 (in red) and PSR3R2P2 
(in blue). 

Step2: Rotate in 2D q to get r and provisionally change C getting Cr (by means of the 
just introduced lift operation). 

Step3: Check that C and Cr are topologically equivalent. 

2.3.1. Quadrilateral Construction The edges ei and 62 are divided in two cut edges 
by the crossing x (see Figure [3]A) . We construct a quadrilateral with vertices on the 
four cut edges such that it contains no other edges of the polygonal link projection 
(clean quadrilateral, see Figure |3]B). We consider the four parametric half lines with 
parameters fc^, i G [1..4] leaving from x along the four cut edges 

ri{ki) = x + ki{pi- x) 
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For a given value of the parameter vector k we get vertices of a quadrilateral q = q(k). 
The vertices follow the order 1, 3, 2, 4. To construct a clean quadrilateral we proceed as 
follows: 

(i) Initialize k by setting each ki = 0.8. 

(ii) Construct the quadrilateral q(k) and compute the list of distances d = {\\qi — 



(iii) Check the cleanness of q via the Xclean algorithm (described below). 

(iv) If q is not clean, consider d and iteratively reduce by half the parameter associated 
with the longest cut edge having intersections (which we call Cmax)- 

Xclean algorithm Given an oriented n-polygon and a polygonal link we can construct 
a n X 2 table S of status of the n vertices. Each row of aS is a pair summarizing the 
intersections of the side entering and leaving the vertex as follows: we assign if the 
relevant side has no intersections with C and 1 otherwise. 

Xclean needs a given quadrilateral a link projection, a 4 x 2 table S (the putative 
status list) and a set indexing the vertices whose relevant sides have to be checked. 
The algorithm simply recomputes the indexed rows of S and updates subsequently the 
adjacent rows. 

2.3.2. Quadrilateral Rotation As a result of the previous algorithm we end up with a 
clean quadrilateral g, whose vertices lie on Ci and 62- By inserting in C the lift of these 
vertices as auxiliary points we will run into technical problems due to parallel edges. 
To overcome this problem we generate a new quadrilateral r by rotating g of a suitable 
angle a around x (Figure |3]C) via the the following steps: 

(i) Set 9 = 9{ei^ 62) equal to the minimum angle between the vectors ei and 62. 

(ii) Initialize 



where e = 0.01 is chosen such that an edge (e.g. Ci) does not bridge the starting 
position of the other edge (e.g. 62)- 

(iii) Construct r. 

(iv) Check the cleanness of r through the Xclean algorithm. 

(v) If not, iteratively reduce by half a until r become clean. 

Given r we can construct Cr by considering the triangle p^r^x (see Figure [3]D) and 
replacing the original cut edges piX with the path p^r^x (two-side replacement), with 





i G [1..4]. 
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2.3.3. Topological Check The feasibility of the replacement of C with is not obvious 
and requires a careful check, which is accomplished analyzing the newly introduced 
connections. The triangle prx is subdivided in two triangles by the segment qr. The 
absence of intersections in the segments qx and rx is guaranteed by the cleanness of q 
and r. 

We approve the two-side replacement if and only if: 

(i) The edge qr has no intersections. 

(ii) The segments pq and pr intersect the same edges of C preserving intersections order 
and signs. 

Otherwise the rotation angle a is reduced by half and we loop back to Step2. 

2.3.4' Construction of the Skein Configurations The construction of the skein 
configurations requires a distinction between and Cq. 

To construct we initially take the specular image X^^ of the undercross Xi with 
respect to the overcross X2. By replacing the edge R1R2 with the path RiX^^R2 we 
obtain a switched crossing but the projection is not regular anymore. Thus, we slightly 
perturb X^^ by attracting it toward Ri via the formula 

^ i?i + /Csw(-^sw ^1) ^sw < 1 

The constraint on fcgw guarantees that the projection of RiX^^ has no intersections 
with while the projection Xs^R2 has one intersection with 62 but it is not always an 
overpass. If not, we reduce the perturbation via the iterative formula 

^sw ^ (^sw + l)/2 

whose convergence to 1 guarantees that we will eventually obtain an overpass. We set 
the initial value k^^,, to 0.9. 

Given Xg^, to construct Z^siu we replace in C the edge P1P2 with the path Pii?iXswi?2^2 
(see Figure [3]E). Notice that the edge P3P4 is not affected by this construction. 
Instead, the construction of Cq make a full use of R by substituting in C the edges R1R2 
and R3R4 with the connections R1R4 and R3R2 (Figure [3]F). Obviously, this determines 
a shift of the separator indices S and of the numbering of the points following Pi. The 
case where ei and 62 belong to the same component of C is treated differently from 
the case where they belong to different components. In the former, the number of 
components of the link increases while in the latter it decreases. 

2.3.5. Skein recursion We will apply recursively the skein relation ([T]) to reduce a 
given polygonal link i2 to a collection of trivial links, systematically switching the 
under crossings. 

We adopt a greedy approach in which at each recursion we switch the undercrossing 
leading to the Z^siu structure with the lowest number of points and we accordingly 
produce the relevant Cq configuration. 

In order to speed up computations, at each step the configurations are reduced with 
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MSR. The resulting structures are stored as nodes in a skein tree, a binary ordered tree 
rooted at the original link. 

Our goal is to assign to every node n a pair of weights (s^P) where s{n) is precisely 
the skein sign of the crossing of n to be switched and P{n) is the link polynomial of 
n. Notice that while s{n) is known, P{n) needs to be computed. We adopt a dynamic 
bottom-up procedure in which starting from leaves we attach P{n) to inner nodes. 
Leaves are the simplest nodes since given a leaf /, P{1) is known a priori being the 
polynomial of the X-components unlink and there is no undercrossing left {s{l) = 0). 
In the skein tree, every inner node C has two children, say C^^ and Cq^ and P{C) can 
be computed via the recursion formula 

^ f a-'b ■ P{Cs^) + a-'c ■ P{Co) if s{C) = +1 
[ b-'a ■ P{C,^) - b-'c ■ P{Co) if s{C) = -1 
In this way, the polynomial is simply the weight P of the root. 



3. Results and Discussion 



3.1. Validation on tabulated knots and links 

Initially, we validated our methods by computing the HOMFLY polynomial of both 
full structures and minimal stickles representations of tabulated polygonal knots and 
links. We compared our results with a polynomial repository constructed as described 
in Text SL Since standard repositories do not address orientation and chirality, 
a single polynomial is associated to a given structure and a computed polynomial 
could not directly match repository entries. Thus, for each tabulated structure we 
considered mirror images along with all possible orientations (together referred to as 
flips) and computed the corresponding polynomials. At least one of them matched 
the one reported in the polynomial repository. Our complete repository of knots 
up to 10 crossings and oriented links up to 4 components could be browsed at 
http: / / www.pharm.unipmn.it / rinaldi /k nots / index.php[ 

As described above, our HOMFLY polynomial computation associates a skein tree to 
every knot or link, by means of a greedy selection of the crossing to be switched. 
To verify the goodness of this choice we compared it with a flxed choice variant, 
which systematically switches the flrst -1 crossing encountered. We applied both 
algorithms to every knotted structure in the repository (including flips), characterizing 
each tree with two complexity indices, namely the level (corresponding to the number 
of generations, n) and the number of tree nodes k. Figure [i] shows the behavior of k 
as a function of n, with dashed curves representing theoretical constraints. The growth 
curves of the two algorithms obtained via ANCOVA after linearization are signiflcantly 
different, showing that the greedy algorithm performs generally better than the flxed 
choice one. This result is also supported by the evidence that the number of levels 
and conflgurations required for polynomial computation is signiflcantly lower for the 
greedy choice (Wilcoxon test on the pairwise differences, p < 10~^^). Notably, the 
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Figure 4: The Increase of the number of tree nodes as a function of tree 

levels. Trees of both greedy (white/black) and fixed choice (gray) algorithms have 
been clustered according to the number of levels (n). For each cluster a box plot of 
the nodes number has been drawn with a width proportional to the cluster size. Solid 
power curves fit the reported data. Dashed red and blue curves represent respectively 
lower and upper estimates of node numbers. Curve expressions are shown in the legend. 



shrinking of the tree well compensates the extra computational time required by the 
greedy choice and this particularly suggest the usage of this algorithm as structure 
complexity increases. In general, it is possible to find a time threshold such that by 
filtering computational times accordingly, a significant difference emerges supporting 
greedy choice. This suggested the adoption of the greedy algorithm for the reduction of 
protein structures. 



3.2. Application to protein structures 

We applied our algorithms to all the protein structures deposited in the PDB. Each 
entry was preprocessed as described in Methods and the HOMFLY Polynomial was 
computed on the MSR reduced structures. 

Globally, we found 119 knotted proteins (226 parts) of the five knot types shown in 
Figure [2| belonging to the ten previously well defined classes of knotted foldings [l4,24 



A summary table of knots for each knot type along with the relevant HOMFLY 
polynomial is reported in Table 1. For a complete list of knotted proteins ID and 
part details see Table SI. 

Although redundancies with previous studies p!4[p!6t[24] are largely present, the number 
of knotted proteins is lower than what previously reported. This is mainly due to 
topological checks and distance controls (see also Text SI) that allowed to discard 
nonstandard PDB formats and entries having large structural gaps due to missing 
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Table 1: Total knotted entries detected for each knot type. 



knot type 


handedness 


^structures 


#parts 


HOMFLY polynomial 


3i 


R 


103 


184 


_r4 + + i-^m^ 


3i 


L 


3 


3 




4i 




10 


31 


-1 + r2 + P-m^ 


52 


L 


2 


4 


f + + [f + l^)m' 


6i 


R 


1 


4 


l-^-l-^ + f-{l + l-^)m^ 



Entries show the number of knotted structures and relevant parts for each knot type. 

residues. These proteins are often detected as knotted when gaps are connected by 
straight lines, inducing artificial entanglement. 

Among newly detected knotted proteins, two right-handed trefoil knots were identified 
in two recently deposited structures. The first one has been found in the human 
Carbonic Anhydrase VII (CA7), isoform 1 (3mdz) (see Figure [5]A), whereas second one 
has been detected in the uncharacterized ORF from Sulfolohus Islandicus rudivirus 1 
(2x4i) (Figure [5]B), a virus of the extremely thermophilic archaeon Sulfolohus. Notably, 
although the latter protein still needs to be fully characterized to define its relevance, 
it shares more than 50% of its primary sequence with protein B116 (2j85) of Sulfolohus 
turreted icosahedral virus^ which King et al [27j previously reported to contain a slip- 
knot. Thus, it is not surprising that the structure of 2x4i also contains a slip-knot, 
as we confirmed by visual inspection. Moreover, this protein presents a gap toward 
its C-terminus. Since we treat gaps as chain terminators (see Text SI) what we have 
detected is the knotted core of the slip-knot, illustrated in Figure [5]B. The trefoil knot 
in the CA7 belongs instead to the well known right-handed trefoil knotted Carbonic 
Anhydrase superfamily. Knotted core analysis, performed as reported in [T2|[l^, reveals 
that both knots have a quite shallow nature. While a trimming of 28 and 5 residues 
from the N-terminus and C-terminus respectively is sufficient to unknot the Carbonic 
Anhydrase VII, the uncharacterized ORF becomes unknotted after an even deletion of 
5 residues. However, this is sufficient to exclude an art if actual nature of these knots. 
For what concerns recently reported trefoil knots, our results confirm the presence 
of a right-handed trefoil knot in the alpha subunit of human S-adenosylmethionine 
synthetase 2 (2p02) and the artifactual origin of the one detected in the ribosomal 80S- 
eEF2-sordarin complex of Saccharomyces cerevisiae (Islh) first reported in |24|. 
Interestingly, we detected three left-handed trefoil knots respectively in the U2 snRNP 
Rds3p protein of S. Cerevisiae (2k0a), VirC2 protein of Agrohacterium tumefaciens 
(2rh3) and in the uncharacterized protein MJ0366 from Methanocaldococcus jannaschii 
(2efv). A fourth knot detected in the human prothrombin complexed with a 
peptidomimetic inhibitor (Ijwt) was discarded due to a long structural gap. The 
left-handed trefoil knot in the Rds3p protein, which highlight a knotted zinc-finger 
motif, is the deepest knot of this kind reported to date |32|. Indeed, its knotted core 
is preserved after trimming of 19 and 18 residues from the C-terminus and the N- 





Figure 5: The two newly identified right-handed trefoil knots in recently 
deposited protein structures (A) On the top, the secondary structure and the 
accessible surface area (in transparency) of the human Carbonic Anhydrase VII, isoform 
1 (3mdz) is shown. On the bottom, a sausage view cartoon of the same enzyme is shown. 
In this representation, the diameter of the sausage is proportional to the B-factor. The 
thicker the backbone is, the more flexible it is. (B) The same representations as in (A) 
are shown for the knotted core of the uncharacterized ORF from Sulfolobus Islandicus 
rudivirus 1 (2x4i), chain A. Colors change continuously from blue (flrst residue) to 
red (last residue). The last residue of the 2x4i protein is colored in orange, since the 
structure presents a gap toward its true C-terminus end and results a slip-knot when 
the whole structure is considered, as detailed in the text. 



terminus respectively. Since this protein does not resemble protein belonging to the (3a 
class, it shifts the left-handed to right-handed balance to 4 to 5, thus enforcing the non 
preferential handedness hypothesis. 

3.3. Analysis of the MSR algorithm 

As a secondary goal, we were interested in the characterization of an intrinsic feature 
of the MSR algorithm, the move lengths. Remarkably, differently from other proposed 
reduction schemes, here the move length is not constrained a priori to one (this can 
be easily seen in the animated reduction provided as Video SI). This characteristic 
leads to a particularly interesting class of curves which we call reduction curves. 
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representing the time series of residual points during the reduction process. For example, 
Figure |6] illustrates the reduction of the above mentioned U2 snRNP Rds3p, the relevant 
reduction curve and move lengths. 

To analyze these two features, 19316 protein structures were randomly extracted from 
the PDB, further selecting only those proteins of length comprised between the first 
(37 points) and the ninth deciles (357) of protein lengths (15529 structures). Proteins 
were processed with MSR and the number of residual points was associated to the 
corresponding move length at each reduction step. 

We first analyzed moves distribution. The observed distribution of move lengths is 
shown in Figure [7]A., showing that quite long moves are rather frequent. In particular, 
move lengths quartiles are 0,4,13, the mean is 8.61 and 27% of the moves have length 0. 
We then tested if move length depends on protein length. Proteins were sorted by length 
and the relevant move lengths were grouped in 100 equal sized bins, so that for instance 
the first bin contains moves corresponding to shortest proteins. As shown in Figure [7]B, 
the mean of each bin significantly decreases (Mann-Kendall trend test, p < 10~^^) as a 
function of the protein length. An eflFect of final moves has been excluded by considering 
only the first 90% of the reduction process. 

To assess if move length distribution changes during structure reduction, we compared 
the move distributions of the first and fourth quartile of the reduction process. To avoid 
overlaps, we considered reduction sequences of length at least 4 (14346 sequences). A 
significant diflFerence between the two quartiles emerged (Wilcoxon test, p < 10~^^), as 
highlighted in Figure [7|C. Moves with length up to 6 (short moves) are more frequent 
toward the end of the reduction process, while long moves occur preferentially in the first 
reduction quartile. This behavior is also confirmed by comparing the first and second 
half of the reduction process. However, shorter final moves are in principle explained 
by an increase of the edges mean length, as can be seen in Figure |6| 
Finally, an interesting effect emerges when the frequencies of move lengths were analyzed 
as a function of the residual protein lengths at which they occur. By grouping move 
lengths in quartiles, while moves below the median reach the minimum frequency for a 
residual length around 60, the opposite behavior is attained by moves above the median 
(Figure [7]D). Interestingly, a residual length around 60 is the optimum of the reduction 
process, where the frequency of moves reaches its minimum and contextually the 
frequency of long moves is maximum. 

3.4' Running time and complexity 

The computation of the HOMFLY polynomial is known to be NP-hard (9||33j and its 
running time exponentially increases with the number of crossings in the projection. 
However, the application of the MSR algorithm before the polynomial computation 
dramatically reduces the number of crossings, leading to a feasible computation of the 
HOMFLY polynomial for any structure analyzed in the present work. Indeed, the MSR 
algorithm has complexity 0{N'^) in the number of points (i.e. the number of residues 
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Figure 6: MSR reduction curve of the U2 snRNP protein Rds3p On the middle 
are illustrated the 13 reduction steps (b-n) for the Rds3p protein (2k0a) (a). The last 
frame (n) represents the minimal structure of the protein, a left-handed trefoil knot. 
On the top, the residual points are plotted for each frame a-n. The corresponding move 
lengths are shown on the right. 

for a protein) and represents the dominant term in the total computational time for the 
vast majority of the analyzed structures, often independently from their knotted nature. 
In practice, running times are reasonable for any analyzed PDB entry on a 2.4 GHz Intel 
Core 2 Duo processor with 2 Gb of RAM. On average, proteins of length 100, 200 and 
300 take respectively 2, 10 and 20 seconds to be processed. The identification of the 
left-handed trefoil knot in the Rds3p (2k0a) requires 2.8 seconds (2.5 seconds for the 
MSR algorithm + 0.3 seconds for the polynomial computation), whereas the processing 
of the Stevedore's knotted protein (3bjx) takes 23.5 seconds (20 seconds + 3.5 seconds). 

3.5. Implementation 

All code for this work was written in Wolfram Mathematica 7 and executed on a Mac 
OSX platform. We developed the Mathematica package HPKnots.m based on the code 
provided as Text S3. HPKnots.m can be obtained upon request. The validation code 
also required KnotTheory .m, a third-party Mathematica package (http: / /katlas.org) . 
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Figure 7: MSR algorithm analysis (A) The observed distribution of move lengths, 
considering only density values greater than 0.2%. (B) The mean move length 
significantly decreases as a function of the protein length. Values for each length 
percentile are reported. (C) Move length distributions are shown relatively to the first 
and fourth quartile of the reduction process, considering only density values greater 
than 0.2%. (D) Frequencies of classes of move lengths as a function of protein residual 
length at which they occur are dotted. LOESS curves are reported. Classes cutoflFs were 
chosen according to move length quartiles (0,4,13) and the last 5% of residual lengths 
were discarded to remove frequency fiuct nations. 

3. 6. Conclusions 

We have presented a novel topological framework for the HOMFLY polynomial 
computation of polygonal paths based on the geometric construction of Conway skein 
triples. Validation on tabulated knots and links demonstrates the global method 
robustness and the eflFectiveness of the greedy selection of the crossing to be switched. 
These evidences have been further confirmed by the polynomial computation of protein 
structures, also leading to an up-to date table of knotted structures. Whereas the 
performed topological checks allowed to discard artificially entangled proteins, two new 
right-handed trefoil knots have been detected. 

Remarkably, the application range of the presented framework is not limited to proteins 
and it can be extended to the topological analysis of biological and synthetic polymers. 
Particularly, the study of knotted synthetic polymers like polyethylene has led to insights 
into the mechanical properties of such structures. The presence of a knot strongly 
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weakens the polymer that potentiaUy breaks at the entrance to the knot. Furthermore, 
knots frequency depends on the solvent and is higher in the coil phase than the globular 
phase with the knotted core size that increases as a function of the number of monomers. 
These aspects have been previously addressed with the computation of the Alexander 
polynomial in numerical simulations based on a simplified model of polyethylene |17| . 
Our framework can be successfully applied to this model and possible refinements, 
contributing to extend the knots spectrum so far considered and providing information 
about the knots chirality. Another suitable field of application of our method, in which 
generally more complex knots are investigated, is the topological study of cyclized 
DNA (5j^. 

Finally, the applicability of the presented method is not confined to single component 
structures and can be applied to the topological study of multicomponent polygonal 
paths, providing a robust identification of knots or links when the frequency of entangled 
structures has to be addressed. 
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